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Motivation: The solution of high-dimensional inference and prediction problems in computational bi- 
ology is almost always a compromise between mathematical theory and practical constraints such 
as limited computational resources. As time progresses, computational power increases but well- 
established inference methods often remain locked in their initial suboptimal solution. 
Results: We revisit the approach of Segal et al. (2003) to infer regulatory modules and their condition- 
specific regulators from gene expression data. In contrast to their direct optimization-based solution 
we use a more representative centroid-like solution extracted from an ensemble of possible statistical 
models to explain the data. The ensemble method automatically selects a subset of most informative 
genes and builds a quantitatively better model for them. Genes which cluster together in the majority 
of models produce functionally more coherent modules. Regulators which are consistently assigned to 
a module are more often supported by literature, but a single model always contains many regulator 
assignments not supported by the ensemble. Reliably detecting condition-specific or combinatorial 
regulation is particularly hard in a single optimum but can be achieved using ensemble averaging. 
Availability: All software developed for this study is available from http://bioinformatics.psb. 
ugent .be/software/ 

Supplementary information: Supplementary data and figures are available from http:// 
bionf ormatics .psb. ugent . be/supplementary_data/anjos/modulejnets_yeast/ 



I. INTRODUCTION 

One of the central goals of the top-down approach to 
systems biology is to infer predictive mathematical net- 
work models from high-throughput data. Much of the 
driving force for the development of network inference 
methods has come from the availability of various types 
of large-scale data sets for particular model organisms 
like S. cerevisiae and E. coli. In contrast, data generation 
for other organisms has been much slower and mainly 
focused on gene expression data. These gene expression 
data sets for typically more complex organisms pose 
their own challenges, such as a higher number of genes, 
limited number of experimental conditions, and sup- 
posedly a more complex underlying transcriptional net- 
work. Therefore, improvement and refinement of meth- 
ods for network inference from gene expression data 
continues to be of great interest. Several reviews on a 



variety of methods have been written ( Bansal et al. 2007 
Bussemaker et al.\ |2007{ |Friedman| |2004| |Gardner anc 



Faith) |2005) , and development of new methods remains 
an active area of research ([Alter and Golub , 2005; Basso 
eFaTl[2005}[Bonneau et gT)|2006) |Faith et al.\pM7). Here 
we revisit the module network method of (|Segal et~aC 



2003) to infer regulatory modules and their condition 



specific regulators from gene expression data and show 
that better and more refined module networks can be 
obtained by using advanced statistical and computa- 



Monte Carlo dLiu) |2004) and ensemble strategies 
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Following (|Hartwell et 


il. 1999|) a 'module' is to be 



viewed as a discrete entity composed of many types of 
molecules and whose function is separable from that of 
other modules. Understanding the general principles 
that determine the structure and function of modules 
and the parts they are composed of can be considered 
one of the main prob lems of contemporary systems biol- 
og y (|Hartwell et gl) |1999||. The module network method 



of ( Segal et al. 2003 addresses this problem using gene 



expression data as its input. It has yielded novel biolog 
ical insigh ts in a number of co mplex eukaryotic sy stems 



(|Lee et gZ.| 12006) |Li et al] [2007} |Novershtern et g/!]|2008 
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Segal et al. |2003))2007[Zhu et gl.}|2007) andhas been the 
source of inspiration for numerous computational ap- 
proaches to network inference as evidenced by its high 
number of citations. A module network is a probabilis- 
tic graphical model (Friedman, 2004) which consists of 
modules of coregulated genes and their regulatory pro- 
grams. A regulatory program uses the expression level 
of a set of regulators to predict the condition depen- 
dent mean expression of the genes in a module. (Segal 
et gZ.||2003) used a deterministic optimization algorithm 
that searches simultaneously for a partition of genes into 
modules and a regulation program for each module. We 
consider both as separate tasks. When searching for 
modules, often many local optima exist with partially 
overlapping modules differing from each other in a few 
genes. We use a Gibbs sampling approach for two-way 
clustering of genes and conditions to generate an en- 
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semble of partially overlapping partitions of genes into 
modules a nd pro duce an ensemble averaged solution 
floshi et oT||2008) . This centroid solution consists of so- 
called tight clusters, subsets of genes which consistently 
cluster together in almost all local optima. We also use a 
probabilistic method for learning regulatory programs. 
These regulatory programs take the form of fuzzy de- 
cision trees with regulator expression levels at the deci- 
sion nodes and gener alize the regression tree approach 
of ( |Segal et~aL 2003) . By summing the strength with 
which a regulator participates in each member of an en- 
semble of regulatory programs for a certain module, we 
obtain a regulator score which gives a statistical confi- 
dence measure for the assignment of that regulator. To- 
gether, the Gibbs sampling cluster algorithm and proba- 
bilistic regulatory program learning provide a computa- 
tionally efficient method to generate ensembles of mod- 
ule networks from which a centroid-like summarization 
can be constructed. 

We have applied this ensemble method to the very 
same data set as < |Segal etal 2003} and performed sev- 
eral comparison tasks. First, we considered the proba- 
bilistic models and evaluated them on training as well 
as test data. We show that the model inferred by (Segal 
et al. , 2003 ) is equivalent to a single instance of the en- 
semble of models inferred by our algorithm. The tight 
clusters obtained from the ensemble solution generate a 
quantitatively better model than each of the single in- 
stances, including the model of (Segal et al] 120031. Sec- 
ond, we compared the clustering of genes. Tight clus- 
ters are in general more functionally coherent and im- 
prove the original modules in two ways. They can re- 
move spurious profiles and fetch only the core of tightly 
coexpressed genes from a single module, or they can 
merge separate but related modules into one cluster. 
Third, we used the regulator score to analyze the net- 
work of modules and their associated regulators from 
(Segal et al. 2003) . We show that this network contains 
both high- and low-scoring regulators and that several 
high-scoring regu lators are missed by the solution of 
(Segal et al. 20031. In general, regulator assignments 
which can be validated by external sources such as ChIP 
data or literature are highly ranked. In combination 
with the tight clusters, the probabilistic method assigns 
more regulators supported by literature and the clus- 
ters to which they are assigned contain a higher ratio of 
known t argets compared to the module network of ( |Se-| 
gal et al. , 2003 1. Fourth, we show that the regulator scor- 
ing scheme can also be used to infer context-specific and 
combinatorial regulation by identifying pairs of regula- 
tors which occur significantly often together in the same 
regulation program. 

Finally we have applied the ensemble method to 
a bHLH module network that was r ecently in ferred 
for mouse brain \U et al.\ [20071 - ( |Li et al] [2007} 
used their module network to make several hypothe- 
ses about modes of combinatorial regulation among dif- 
ferent brain tissues. We show that only few of these 



hypotheses are statistically supported by the ensemble 
method. This example illustrates the usefulness of an 
approach which can generate internal significance mea- 
sures, in particular if no other data sources are available 
to validate hypotheses generated by a single local opti- 
mum. 

Together all these results convincingly show that the 
ensemble method for learning module networks signif 
icantly improv es the direct optimization method of ( Se 
gal et al. |2003) . Unlike a single optimum, ensemble av- 
eraging allows the assessment and prioritization of the 
statistically most reliable modules and their condition- 
specific regulators. Such high-confidence modules can 
be used directly for generating experimentally verifi- 
able hypotheses or can be integrated with other, perhaps 
smaller-scale, data sources to create a more comprehen- 
sive view of the underlying networks. 



II. RESULTS AND DISCUSSION 

A. Data and procedure 

We obtained all data from the supplemental website 
of ( |Segal et al. 2003) , including expression data, gene 
modules and regulatory programs. Using the Gibbs 
sampler we generated 12 different partitions of genes 
into modules which were combined into one set of tight 
clusters. The number of clusters is determined auto- 
matically by the Gibbs sampler and ranges from 65 to 
78 in the different runs, co mpare d to the predefined 
value of 50 of ( |Segal et al] [2003) . 1892 of the 2355 
genes in the data set could be assigned with high con- 
fidence to 69 tight clusters. To generate regulator as- 
signment scores, we learned 10 probabilistic regulation 
programs per module with 100 regulator and split value 
pairs sampled per regulation program node. More de- 
tails about these procedures are given in the Methods. 
This resulted in four different module network models: 



1. SCSR: Segal clusters with Segal regulation pro 
grams, c orresponding exactly to the results of (jSe 
gal et gZ.[[2003|. 



2. SCPR: Segal clusters with probabilistic regulation 
programs. 

3. GCPR: Gibbs sampler clusters (single run) with 
probabilistic regulation programs. 

4. TCPR: Tight clusters (multiple Gibbs sampler runs 
combined) with probabilistic regulation programs. 



B. Model evaluation 

A module network infers a probabilistic model which 
explains relations between expression levels of a set 
of genes. More precisely, there is a probability dis- 
tribution p(x\, . . . , Xn) which computes the probability 
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FIG. 1 Model evaluation experiments, (a) Histogram of L = 
log p{x\, . . . , xjv) for SCPR (blue) and non-parametric fit of 
the histogram for GCPR (black curve), (b) Histogram of t 
for SCSR (red) overlayed on histogram of L for SCPR (blue), 
with non-parametric fits (black curves), (c) Histogram of t 
for SCPR (blue) overlayed on histogram of L for TCPR (ma- 
genta), with non-parametric fits (black curves), (d) Histogram 
and non-parametric fit (left black curve) of L for GCPR learned 
on training data and evaluated on test data (green) and non- 
parametric fit of the same models evaluated on training data 
(right black curve). All histograms and curves are normalized 
to have area equal to 1. 



(density) to observe a particular combination of expres- 
sion levels Xj for a set of N genes. This probabilis- 
tic model predicts the response in expression of genes 
in a module upon perturbations of its regulators, such 
as knock-out or overexpression, and thus yields bio- 
logically verifiable hypotheses. For a module network, 
the distribution p{x\,..., xjj) is a product of N factors 
(see Methods), so we consider the normalized quan- 
tity t = Jrlogp(xi, ...,Xj,f) which can be compared 
between models with potentially different numbers of 
genes. Higher values of t mean better explanation of 
the data by the model, i.e. more accurate prediction of 
the outcome of new experiments. 

First we performed evaluations on each of the condi- 
tions in the original data set. Figure [T] (a) shows that the 
histogram of t-values for SCPR fits well within a non- 
parametric curve fit of the histogram for GCPR. This 
implies that the clusters found by ( |Segal efaT. |2003) are 
equivalent to one local optimum identified by me Gibbs 
sampler procedure. Figure[T](b) shows the histogram of 
t-values for SCSR (red) overlayed on the histogram for 
SCPR (blue), both with non-parametric curve fits. The 
mean t-values obtained by SCPR are higher than SCSR 
by a one-tailed f-test (a = 0.01) proving that probabilis- 
tic regulation programs give a better explanation of the 
data. In Figure [l](c) we compared SCPR to TCPR. TCPR 
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FIG. 2 Histogram of the highest fraction of genes in one mod- 
ule in a MIPS functional category for TC (red) and SC (blue), 
sorted by ratio difference. 



has a higher mean t than SCPR with a one-tailed t-test 
(a = 0.01). This shows that tight clusters are selecting 
a subset of genes which are the most informative and 
therefore generate a better model. 

Next we tested how well these models explain unseen 
data by performing a cross-validation experiment. We 
removed 10% of the conditions at random from the com- 
plete data (the test set) and ran the Gibbs sampler once 
on the remaining 90% (the training set). The resulting 
model was then evaluated on the test set. This proce- 
dure was repeated 10 times and all test set evaluation 
values were collected in one histogram and compared 
to the training set values (Figure[T|(d)). The curve of the 
test set is slightly shifted to the left with respect to the 
training set curve, as one would expect, but both curves 
have the same mean with a one-tailed f-test (a = 0.01). 
This shows that the probabilistic models indeed gener- 
alize to unseen data. 



C. Gene clustering improvement 

We have shown in the previous section that SCPR is 
equivalent to GCPR but TCPR gives a better model over 
SCPR. We also observe that tight clusters (TC) are over- 
all more functionally coherent than the clusters obtained 
in (Segal et al. 20031 (SC). Figure [2] shows the fraction of 
genes in a cluster belonging to a MIPS functional cate- 
gory which is significantly overrepresented (p < 0.001) 
in SC and TC. Several examples illustrate the general 
trend seen in this figure. In TC-40, 4/7 genes are in- 
volved in amino acid transport compared to SC-27 with 
8/53 genes. In TC-27, 7/9 genes belong to purine nu- 
cleotide anabolism compared to SC-11 with 6/53 genes. 

Segal cluster 1 (SC-1) contains 55 genes, out of these 
32 (58%) are validated targets of Hap4, a global regula- 
tor of res piratory genes, according to the YEASTRACT 
database ( |Teixeira et al. 2006| . This cluster has maxi- 




FIG. 3 TC-7 with Hap4 assigned as a top regulator. Genes 
known to be regulated by Hap4 in YE ASTRACT are marked in 
blue and those involved in respiration are marked in orange. 
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FIG. 4 (a) TC-27 fetches the core of tightly coexpressed genes 
from SC-11; 67% genes in TC-27 are known to be Basl targets, 
(b) SC-8 and SC-9 which have similar expression are merged 
into TC-11. SC-8, SC-9 and TC-11 all are enriched for Gatl 
targets. 



were separate in SC (Figure|4](b)). 



mum overlap with tight cluster 7 (TC-7) with 30 genes 
out of which 25 (83%) are known Hap4 targets. The 
five remaining genes are Qcr6, Cox5a and Fuml, all 
located in mitochondrion and involved in respiration, 
and two unknown genes Ygll88c and Ygrl82c. With 
24 / 30 resp iratory gene s (80%), TC-7 even improves on 
COGRIM < |Chen et al] \2007) which combines multiple 
data sources. Using expression data alone (the same 



data set as ( |Segal et flZ.T|2003| ), ( |Chen et gj.]|2007> obtain 
a cluster with 32/51 (62%) genes belonging to MIPS res- 
piration category. Using both ChIP and expression data, 
they obtain a cluster with 23 / 34 (68%) respiratory genes, 
significantly lower than TC-7. Figure [3] shows TC-7 with 
known Hap4 targets and respiratory genes marked in 
blue and orange respectively. 

TC-27 contains nine genes which form a subset of SC- 
11 containing 53 genes (Figure [4] (a)). Six genes (67%) 
in this cluster are known Basl targets compared to only 
18% Basl targets in SC-11. TC-28 and TC-37 contain 70% 
and 100% known targets of Msn4. These clusters have 
a large overlap with SC-3 and SC-41 respectively, which 
have 55% and 93% known targets of Msn4. TC-1 con- 
sists of 51 genes, out of which 28 (55%) are known to 
be Swi4 targets. This module merges genes from SC- 
10, 29 and 30. They have 4/37 (11%), 19/41 (46%) and 
8/30 (27%) Swi4 targets respectively. TC-11 contains 
genes of SC-8 and SC-9 whose highest ranked regula- 
tor is Gatl (see Section[lLD] (Figure[I](b)). YEASTRACT 
data confirms 17% of these targets, while for SC-8 and 
9 overall 15% targets are confirmed by YEASTRACT. 
TC-35 is overrepresented for genes involved in RNA ex- 
port from nucleus (p-value 10 -8 ). It overlaps with SC- 
19, 31 and 36 (p-values ~ 10 -3 ). TC-31 contains genes 
mainly involved in ribosomal biogenesis (p-value 10~ 13 ) 
and combines relevant genes from SC-13, 14 and 15 (p- 
values ~ 10 -4 ). 

We conclude that tight clusters improve clustering re- 
sults obtained by (Sega i~ef aZ.j |2003) in two ways. They 
can fetch only the core of tightly coexpressed genes from 
a SC (Figure [4] (a)), or they can merge clusters which 



D. Regulator assignment prioritization 

The ensemble approach generates multiple equally 
plausible regulatory programs for a single module in 
a probabilistic fashion. The regulator assignment score 
which takes into account how often a regulator is as- 
signed to a module, with what score, and at which level 
in the regulation tree, can therefore be used to prioritize 
regulators (highest regulator score gets topmost rank). 

First we consider only the difference between prob- 
abilistic regulator assignment and the original method 
by comparing SCSR with SCPR, hence keeping the gene 
modules the same for both methods. Figure [5] shows 
regul ator-module links in SCSR (cfr. Figure 5 in (Segal 
et al. 2003) ). The edges colore d red are the ones sup- 
ported by literature (data from (Segal et al. 20031). To 
each edge we add the rank with which it is assigned 
in SCPR. Regulator-module links supported by litera- 
ture have often a higher rank. SCSR assigns Hap4, a 
global regulator of respiratory genes, to SC-1. This clus- 
ter contains 58% known Hap4 targets and Hap4 has 
second highest rank in SCPR. SCSR also assigns Hap4 
to SC-10 which contains genes involved in amino acid 
metabolism. SC-10 has only 2/37 (5%) known Hap4 
targets according to YEASTRACT and this assignment 
is ranked very low (rank 73) in SCPR. Several high- 
ranking SCPR assignments which were missed by SCSR 
could also be validated using (Harbis on et al.\ |2004[ ) 
data (p-value < 0.005). We assign Gal80, a transcrrp- 
tional regulator involved in the repression of Gal genes 
in the absence of glucose, with second rank to SC-6. 
This is a cluster of four Gal genes, Gall, Gal2, Gal7 and 
GallO. Met32, a zinc -finger DNA-binding protein in- 
volved in transcriptional regulation of the methionine 
biosynthetic genes assigned with third rank to SC-8, and 
Gisl, a histone demethylase assigned to SC-3 with 5th 
rank, are supported by YEASTRACT (respectively 5/29 
and 6/31 known targets). 

Next we compared TCPR with SCSR to analyse the 
combined improvement made by ensemble averaging 
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at the level of gene clustering as well as at the level of 
regulator assignment. For TCPR, we selected the top six 
regulators for each cluster. This rank cutoff was deter- 
mined as follows. We computed the significance for the 
overlap between each tight cluster and each transcrip- 
tion factor target set using the YEASTRACT database. 
A reference module network was formed by keeping 
all transcription factor - tight cluster edges below a cer- 
tain p-value cutoff. By comparison with this reference 
network we found that a rank cutoff of six gives the 
best overall F-measure score at different p-value cutoffs 
(see Supplementary information). A similar analysis for 
SCSR shows that the F-measure for TCPR is consistently 
higher (see Supplementary information). To compare 
TCPR and SCSR in more detail, we identified for each 
regulator the cluster with the highest fraction of known 
targets in YEASTRACT. Likewise we find the best clus- 
ter for each regulator in SCSR. Figure |6]shows that TCPR 
assigns more regulators supported by YEASTRACT and 
also that the clusters contain a higher ratio of known tar- 
gets. There are six regulators assigned by both methods, 
four of which HAP1, GAT1, TOS8 and XBP1 all are as- 
signed to clusters more enriched in their known targets 
in the TCPR solution. 



E. Context-specific and combinatorial regulation 



(Segal et al. 20031 used a decision tree approach to 
model regulatory programs because it can represent, 
at least in principle, context-specific and combinato- 
rial regulation. In the ensemble language, context- 
specificity means a regulator gets a high overall score by 
being assigned consistently to a lower, non-root level in 
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FIG. 6 Histogram of the highest fraction of known targets of 
transcription factors in a module using TCPR (red) and SCSR 
(blue) according to the YEASTRACT database. The rank with 
which a regulator is assigned to a module in TCPR is indicated 
in brackets. 



the set of decision trees for a certain module. In SCPR, 59 
regulator assignments divided over 39 (out of 50) mod- 
ules have a significant score contribution (value > 100) 
from a non-root level (see the Methods for the decom- 
position of the score function over different tree levels). 
Combinatorial regulation means two (or more) regula- 
tors are consistently assigned together at different levels 
in all decision trees. Although this form of combina- 
torial regulation may correspond to genuine biological 
combinatorial regulation, we take a strictly data driven 
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definition here: combinatorial regulation in the decision 
tree sense means the expression levels of both regula- 
tors are needed together to explain the expression level 
of the module ('AND' regulation). Alternatively, two (or 
more) high-scoring regulators may achieve their high 
rank from the same decision tree level (usually the root 
level). In this case both regulators explain the module 
equally well alone ('OR' regulation). In SCPR, there are a 
total of 100 regulator assignments with significant score 
contribution from the root level (OR regulation), which 
can be combined with the 59 assignments at level 1 for 
potential AND combinatorial regulation. 

Only few of the significant AND combinatorial regu- 
lation pairs are present in the single-optimum solution 
of SCSR (see edge ranks in Figure pi. SC-47 has Gcn20 
as the highest ranked regulator at level and Cnbl at 
level 1, and both assignments are supported by liter- 
ature. SC-36 has two validated regulators Gcn20 and 
Not3 ranked first and third respectively in SCPR, but the 
score of Not3 is low and not deemed significant. SC-4 is 
an example of OR regulation wrongly assigned in SCSR. 
In SCSR, Ypl230w is assigned at level and Gael at level 
1, but in SCPR both are assigned at level with first and 
third rank respectively and no high-scoring regulator is 
found at level 1. Some of the AND combinatorial reg- 
ulation pairs in SCPR that were missed in SCSR can be 
validated by YEASTRACT. SC-40 has Tos8 assigned at 
level (overall rank 1) and Yapl at level 1 (overall rank 
2). Tos8 has 3/15 known targets in this module while 
Yapl has all known targets (15/15). SC-26 has Gael at 
root level (overall rank 1) and Mall3 at level 1 (overall 
rank 2). Mall3 has two known targets (out of six known) 
in SC-26. 

Due to the high number of possible regulator combi- 
nations, identifying statistically significant regulation of 
AND-type is an even more complex problem than sim- 
ple regulator assignment. These examples show that 
also for this problem, the ensemble approach is well 
suited. 



F. Module network in mouse brain 



Recently, ( |Li et al. |2007 1 reconstructed a bHLH tran- 
scription factor regulatory network in mouse brain by 
a direct application of the method of (Segal et al. 2003 1. 
They selected a small data set of 198 genes and 22 con- 
ditions, built a module network using 22 bHLH tran- 
scription factors as candidate regulators and assigned 
15 different regulators to 28 modules (denoted again by 
SC), out of which 12 (43%) have at least two genes in the 
same GO category. Based on the co-occurence of regula- 
tors in the regulation programs of individual modules, 
( |Li et al. [2007) make hypotheses about different modes 
of coregulation among brain tissues which are currently 
not confirmed by other data sources. We applied the en- 
semble method on this data set and got 17 tight clusters 
(denoted by TC), out of which 11 (65%) have at least two 



genes in the same GO category. 

Only 11/28 SC have a high-scoring regulator with 
a significant score contribution from a non-root level, 
compared to 39/50 for yeast. ||Li et al. 2007b use the 
co-occurrence of Neurod6 and Hey2 in the SR regula- 
tion programs of SC-10, 15 and 27 to predict a cross- 
repression between Neurod6 and Hey2 with different 
modes of coregulation in different brain tissues. In 
the probabilistic regulation programs (PR), Hey2 is the 
highest ranked regulator for SC-10, consistently as- 
signed to the root level. However, at level 1, there 
are three equally good regulators Hes5 (overall rank 4), 
Neurod6 (overall rank 5) and Npas4 (overall rank 2). For 
SC-15, Neurod6 is the highest ranked regulator, consis- 
tently assigned to the root level, but the assignment of 
Hey2 at level 1 has a very low score (overall rank 4). 
For SC-27, we find consistent assignments of Hey2 at 
root level with overall rank 1 and Neurod6 at level 1 
with overall rank 2. Thus the cross-repression mecha- 
nism predicted by I Li et al. 2007 1 is supported only in 
the case of SC-27 and not SC-10 and 15. This example 
underscores the usefulness of an ensemble method to 
assess confidence levels of predicted interactions, espe- 
cially in cases with limited amount of expression data 
and no other validation sources available. 



III. CONCLUSIONS 



We have reexamined the module network method of 



(Sega Fet al.\ 2003 and compared an ensemble-based 
strategy to the standard direct optimization-based strat- 
egy. Ensemble averaging selects a subset of most infor- 
mative genes and builds a quantitatively better model 
for them. It finds functionally more coherent tight gene 
clusters and is able to determine the statistically most 
significant regulator assignments. The difficult prob- 
lem of identifying multiple regulators which explain 
together, but not separately, the expression of a mod- 
ule can be addressed in a reliable way. The ensem- 
ble method is thus able to deliver the promise to infer 
context-specific and combinatorial regulation through 
the probabilistic module network model. 



IV. METHODS 

A. Bayesian two-way clustering 

We associate to each gene i a continuous valued ran- 
dom variable X, measuring the gene's expression level. 
For a data matrix T> = (x, m ) with expression values for 
N genes in M conditions, the module network model 
of ( Segal et al. 2003} gives rise to a probabilistic model 
for two-way clusters, where a two-way cluster k is de- 
fined as a subset of genes Ak C {1, . . . , N} with a par- 
tition £/ c of the set {1, ...,M} into condition clusters. 
The Bayesian posterior probability for a set of coclusters 
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(A k , £ k ), denoted C, is given by 



where 
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where pix \ }i, x) is a normal distribution with mean u 
and precision t and p(f/, t) is a normal-gamma distribu- 
tion (see ( |Segal et aT} |2005 ) or ( TJoshi et al\ 2008) for more 
details). We use the Glbbs sampler strategy developed 
in ( Joshi et al. 2008) to sample multiple high-scoring co- 
clusterings from this posterior distribution. From these 
multiple solutions we extract tight gene clusters using 
the procedure outlined in ( [Joshi et al. |2008 1. It consists 
of a graph spectral method extractmg densely connected 
regions from the graph on the set of genes with edge- 
weights pij, the frequency that gene i and j belong to the 
same cocluster in each of the sampled solutions. 



B. Probabilistic regulatory programs 

For each set of conditions E in the condition partition 
£ k for a given module k we have an associated normal 
distribution with parameters (^E/ t e) which can be es- 
timated from the posterior distribution. Hence such a 
condition set can be interpreted as a discrete expression 
state for the module. A regulatory program 'predicts' 
the expression state of any condition in terms of the ex- 
pression levels of a small set IZj, of regulators, i.e., there 
is a conditional distribution 

p(x( | {x r ,r £ 1Z k }) = p(xi | He/Te)- 

The selection of an expression state is done by con- 
structing a decision tree with the states E £ £ k at the 
leaves. To each internal node t, we associate a regulator 
r t and split value Zf. In ( |Segal et ~aT. 2003) , the decision 



at the node is based on the test x Vt > Zf or x,- f < z f . 
Here we extend this model to allow fuzzy decision trees. 
More precisely, we sort the expression states E £ £ k by 
their mean y,£, and link this ordered set hierarchically. 
Then we can associate to each internal node a binary 
variable yt — ±1, where yt — —1 means 'decrease ex- 
pression state' (go 'left' in decision tree) and yt = +1 
means 'increase expression state' (go 'right' in decision 
tree). Again we also associate a regulator rt and split 
value Zf to node t, and a conditional probability 



p(y t | x n ,zt,fht 



1 



1 j te -Ptyt{x rf -z t ) ' 



(1) 



Given expression values x r for all r £ lZ k , we traverse 
the decision tree in a probabilistic fashion, taking the de- 
cision yt = ±1 at each node t by tossing a biased coin 
with bias eq. ([I). The original model with hard decision 
trees is recovered if fit = ±oo for each node. 

The conditional distribution or regulatory program 
now becomes a normal mixture distribution 

p(xi | {x r ,r £ 1Z k }) = Yi a E £ Tl k }) p(x.i | fi£,T £ ) 

Eee k 

(2) 



*e({w e TZ k }) 



t 



t ,Zt,$t) 



with the values yt determined by the unique path 
through the decision tree that ends at leaf E. 

For a cocluster (A k , £ k ) inferred from a data set T> — 
(xj m ) by the method summarized in the previous sec- 
tion, we can derive a posterior probability function for 
each regulator at each node t as follows. First note that 
each condition m belongs to exactly one set E in A k and 
hence determines a unique path through the decision 
tree, or in other words a set of values yt t m at each node t. 
Furthermore, each node t has an associated condition set 
Et consisting of the union of all condition sets E which 
can be reached from node t. Hence we can define at each 
node a posterior probability by 



Ppost[(r,z)] (x max( J] p(y t , m \ x r/in ,z,f) 



meEf 



(3) 



where for computational simplicity we maximize over 
/3 instead of marginalizing over a prior distribution. By 
allowing only a discrete set of split values, eq. (|3j be- 
comes a discrete distribution from which it is easy to 
sample. Typically, we consider as possible split values 
z the expression values x,-, m for m £ Et, but simpler 
schemes such as only allowing one or two split values 
can be used to reduce computation time for large data 
sets. 

The posterior probability eq. l|3j measures how well 
the expression values of a regulator 'predict' the parti- 
tion into two sets of Ef induced by the condition parti- 
tion £ k . We define the average prediction probability of 
(r, z) at node t by the geometric average 

Pt(r,z) = ( El Viyt,m I *J-,m,Z,/3max)J , (4) 

mGEf 

where /3 max is the maximizer in eq. l|3j. 



C. Regulator assignment score 

To assess the significance Zf (r) for assigning a regula- 
tor r to a node t in a certain regulation program, we use 
the average prediction probabilities (eq. B) and define: 



Zf(r) = wtY J Vti r ' z )- 



(5) 



I E I 

A typical choice for the weight factor u>t is u>t = -jjf, ex- 
pressing that we have more confidence in assignments 
to nodes supported on more conditions. The sum 
runs over the discrete set of split values for regulator r 
at node t. The overall significance Z(r) for assigning a 
regulator r to a module is defined by summing eq. l|5| 
over all nodes of all regulation programs for that mod- 
ule: 

Z(r)= E E Z fW- 
TeTteT 
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D. Model evaluation 



References 



For an experiment with expression levels {x\,..., Xjvj), 
we can evaluate the probability distribution 

N 

p(xi,...,x N ) = Ylv( x i I i x '-' r E fck(i)})> 
i=\ 

with k (z) the module to which gene i belongs and 72, j. the 
regulator set of module k, using the conditional distribu- 
tions (2). We only consider genes for which the model 
makes actual predicitions, i.e., genes belonging to clus- 
ters with a regulation tree. For the cross-validation ex- 
periment, we removed 10% of the conditions randomly 
from the total of 173 conditions. We learned module net- 
works on the remaining 90% data and repeated this pro- 
cedure 10 times. 



E. Data sets 

Yeast expression data for 2355 differentially 
expressed genes in 173 stress conditions, gene 
clusters, their regulators, split values and re- 
gression trees were downloaded from the sup- 

(jSegal 



plemental website of 
http: / /robotics.stanford.eduT^- 
MIPS functional catagories 



et 



20031 at 



erans/modulejnets / 
were downloaded from 
ftp://ftpmips.gsf.de/catalogue/annotation_data For 
TC and SC we calculated the p-value whether the 
overlap between a given cluster and a given functional 
catagory is statistically significant. We used data on 
genome-wide binding and phylogenetically conserved 
motifs for 102 transcription factors from ( jHarbison 



et al. 2004|. For a given transcription factor, only genes 
that were bound with high confidence (significance 
level a = 0.005) and showed motif conservation in 
at least one other Saccharomyces species (besides S. 
cerevisiae) were considered true targets. We also down- 
loaded all known regulator target interactions from the 
YEASTRACT database http:/ / www.yeastrac t.com| We 
calculated the p-value whether the overlap between a 
given cluster and a given transcription factor target set 
is statistically significant. 

Mouse expression data by (|Su et a/\{| 2004| was down- 
loaded from http: / /wombat.gnf .org and the data selec- 
tion and normalization was done as described in < fLT| 
eFflD 12007). 
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